Required packages and the data are as follows:
library(ggplot2)
library(dplyr)
library(tidyverse)
library(data.table)
library(plotly)
library(RColorBrewer)
library(FactoMineR)
library(hrbrthemes)
library(RColorBrewer)
train_x<-fread( file= "data/UWave-20201126T225338Z-001/UWave/uWaveGestureLibrary_X_TRAIN", sep=" ")
train_y<-fread( file= "data/UWave-20201126T225338Z-001/UWave/uWaveGestureLibrary_Y_TRAIN", sep=" ")
train_z<-fread( file= "data/UWave-20201126T225338Z-001/UWave/uWaveGestureLibrary_Z_TRAIN", sep=" ")
Raw data is prepared for the analysis and the long format is obtained to be used in some parts.
train_x_df<-as.data.frame(train_x)
train_y_df<-as.data.frame(train_y)
train_z_df<-as.data.frame(train_z)
## formatting x data
names(train_x_df)[1]<-"class"
names(train_x_df)[2:316]<-1:315
train_x_df<-train_x_df%>%
mutate(time_series_id=1:896)%>%
mutate(class=factor(ifelse(class==1, "a",
ifelse(class==2, "b",
ifelse(class==3, "c",
ifelse(class==4, "d",
ifelse(class==5,"e",
ifelse(class==6,"f",
ifelse(class==7, "g",
"h"))))))),
levels=c("a","b","c","d","e","f","g","h")))
x_raw<-train_x_df[-c(1,317)] %>% setNames(paste0('X.', names(.)))
y_raw<-train_y_df[-c(1,317)] %>% setNames(paste0('Y.', names(.)))
z_raw<-train_z_df[-c(1,317)] %>% setNames(paste0('Z.', names(.)))
mds_data<-cbind(train_x_df[c(317,1)],x_raw,y_raw,z_raw)
x_formatted<-pivot_longer(
train_x_df,
cols=-c(time_series_id,class),
names_to = "time_index",
values_to = "X",
)
x_formatted<-x_formatted%>%transform(time_index=as.integer(time_index))%>%relocate(class, .after=X)%>%arrange(time_series_id,class)
## formatting y data
train_y_df<-as.data.frame(train_y)
names(train_y_df)[1]<-"class"
names(train_y_df)[2:316]<-1:315
train_y_df<-train_y_df%>%
mutate(time_series_id=1:896)%>%
mutate(class=factor(ifelse(class==1, "a",
ifelse(class==2, "b",
ifelse(class==3, "c",
ifelse(class==4, "d",
ifelse(class==5,"e",
ifelse(class==6,"f",
ifelse(class==7, "g",
"h"))))))),
levels=c("a","b","c","d","e","f","g","h")))
y_formatted<-pivot_longer(
train_y_df,
cols=-c(time_series_id,class),
names_to = "time_index",
values_to = "Y",
)
y_formatted<-y_formatted%>%transform(time_index=as.integer(time_index))%>%relocate(class, .after=Y)%>%
arrange(time_series_id,class)
## formatting z data
names(train_z_df)[1]<-"class"
names(train_z_df)[2:316]<-1:315
train_z_df<-train_z_df%>%
mutate(time_series_id=1:896)%>%
mutate(class=factor(ifelse(class==1, "a",
ifelse(class==2, "b",
ifelse(class==3, "c",
ifelse(class==4, "d",
ifelse(class==5,"e",
ifelse(class==6,"f",
ifelse(class==7, "g",
"h"))))))),
levels=c("a","b","c","d","e","f","g","h")))
z_formatted<-pivot_longer(
train_z_df,
cols=-c(time_series_id,class),
names_to = "time_index",
values_to = "Z",
)
z_formatted<-z_formatted%>%transform(time_index=as.integer(time_index))%>%
relocate(class, .after=Z)%>%
arrange(time_series_id,class)
combined<-cbind(x_formatted,y_formatted,z_formatted)
combined<-combined[c(1,2,3,7,11,12)]
ts_Position<-combined %>%
group_by(time_series_id) %>%
mutate(X_velocity = cumsum(X),
Y_velocity = cumsum(Y),
Z_velocity = cumsum(Z),
X_position = cumsum(X_velocity),
Y_position = cumsum(Y_velocity),
Z_position = cumsum(Z_velocity))
Firstly, the acceleration values are summed up cumulatively to get the velocity values. These velocity values obtained for three axes are plotted for 8 gesture types. The colors in the graphics change over time. The movement that starts with purple dots ends with a yellow point, and the colors between these colors are scaled again depending on the time. the yellow dot here can be thought as the end of the arrow. It is seen that the starting and ending points overlap by looking at all the plots showing the velocity values. However, this is not the case with the gesture types defined. For this reason, position values need to be calculated in order to associate these gesture types with data. The velocity values are added up cumulatively again to obtain the position values. In the last case, position values are obtained for all axes and when these values are plotted, results similar to the defined gesture types are obtained. One thing I pay attention to while establishing this relationship is that the defined types are 2-dimensional, but the position plots I get are 3-dimensional. Since there is such a difference, I made an assumption and thought as follows: If the 2-dimensional route defined can be seen from any angle on the 3-dimensional position plots, they are compatible. When this logic is applied for 8 gestures, it can be seen that the resulting position plots fit the defined gesture types. For a clearer view, plots can be zoomed in and out. Also, the resulting shape can be rotated, axis values can be viewed by clicking on it.
ts_id<-list(list(11,"a"),list(15,"b"),list(4,"c"),list(5,"d"),list(2,"e"),list(1,"f"),list(7,"g"),list(6,"h"))
data<-ts_Position%>%filter(time_series_id==ts_id[[1]][[1]],class==ts_id[[1]][[2]])
plot1<-plot_ly(data,
x=~X_velocity, y=~Y_velocity, z=~Z_velocity,color =~ceiling(time_index/100),
type="scatter3d", mode="markers") %>%
layout(title = paste("Velocity Chart for a Gesture from Class",ts_id[[1]][[2]]))
plot1
plot2<-plot_ly(data,
x=~X_position, y=~Y_position, z=~Z_position,
color =~ceiling(time_index/100),
type="scatter3d", mode="markers")%>%
layout(title = paste("Position Chart for a Gesture from Class",ts_id[[1]][[2]]))
plot2
Since the position data is found in the previous part compatible, the operations are performed using the position values in this and the following parts. Using the position data of the x, y and z axes of all classes, PCA is performed and eigen vectors are obtained. As a result of PCA, it would be proper to take the first two components that give approximately 90 percent cumulative proportion, as it is desired to maintain the variance in the data by obtaining a high variance as possible. However, since the data is wanted to be reduced from 3 dimensional to 1 dimensional, only the first component is taken.
In the first chart, it is shown how much the variance obtained expresses the variance of the data.
In the second chart, the shares of the x, y and z axes that constitute the first two principal components are visualized.
##PART B
###PCA for all series combined
cor(ts_Position[,c(10,11,12)])
## X_position Y_position Z_position
## X_position 1.0000000 0.2732331 -0.1033031
## Y_position 0.2732331 1.0000000 0.5935095
## Z_position -0.1033031 0.5935095 1.0000000
pca_all <- princomp(as.matrix(ts_Position[,c(10,11,12)]),cor=T)
summary(pca_all,loadings=TRUE)
## Importance of components:
## Comp.1 Comp.2 Comp.3
## Standard deviation 1.2724995 1.0380930 0.5505524
## Proportion of Variance 0.5397517 0.3592124 0.1010360
## Cumulative Proportion 0.5397517 0.8989640 1.0000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3
## X_position 0.209 0.908 0.362
## Y_position 0.723 0.105 -0.683
## Z_position 0.658 -0.404 0.635
PCA_vis<-PCA(ts_Position[,c(10,11,12)])
PCA_vis$eig
## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 1.6192550 53.97517 53.97517
## comp 2 1.0776371 35.92124 89.89640
## comp 3 0.3031079 10.10360 100.00000
Since it is reduced to a single variable, each instance can be plotted as time series. It can be concretized how well the component defines the data and classes by selecting 2 random instances from each class and checking their compatibility.
ts_Position_uni<-cbind(ts_Position,PC1=pca_all$scores[,1])
classes<-c("a","b","c","d","e","f","g","h")
for(i in 1:8){
PCA_all_ts_a<-ts_Position_uni%>%filter(class==classes[i])%>%select(time_series_id,time_index,PC1,class)
random_ids<-sample (unique(PCA_all_ts_a$time_series_id), size=2, replace =F)
PCA_all_ts_a<-PCA_all_ts_a%>%
filter(time_series_id %in% random_ids)%>%
pivot_wider(.,names_from = time_series_id , values_from = PC1 ,names_prefix = "id")
plot<-ggplot(PCA_all_ts_a,aes_string(x=names(PCA_all_ts_a)[1]))+
geom_line(aes_string(y=names(PCA_all_ts_a)[3]),size=2,color="#CC6666")+
geom_line(aes_string(y=names(PCA_all_ts_a)[4]),size=2,color="#9999CC")+
theme_ipsum()+
xlab("Time")+
ylab("PC1")+
ggtitle(paste("Univariate Time Series from Class",classes[i]))
print(plot)
}
By looking at the results, time series are quite compatible in e and f classes. However, when the general evaluation is made, although the time series are compatible as patterns, it is seen that a gap occurs between them. Although it gives good results in e and f classes, it is a random situation to give good results because the selected instances are random.
We can say briefly that the PCA explains class patterns well.
When PCA is applied separately for each class, the first principal component differs between classes. The reason for this is that the weight of the movement on the axes changes in every gesture type. For example, when the a class, the first gesture type, is evaluated, the movement mostly occurs on the y and z axes as it can be seen from the visualization made in Part A. Accordingly, the coefficients of y and z in the first principal component are high for the first gesture.
In short, the first principal component has different axis coefficients in different gesture types because it depends on the weight of the movement in the related axes.
for(i in 1:8){
print(paste("PCA for the Class",classes[i]))
ts_Position_class<-ts_Position%>%filter(class==classes[i])
cor(ts_Position_class[,c(10,11,12)])
pca_class <- princomp(as.matrix(ts_Position_class[,c(10,11,12)]),cor=T)
print(summary(pca_class,loadings=TRUE))
PCA_vis_class<-PCA(ts_Position_class[,c(10,11,12)])
print(PCA_vis_class$eig)
}
## [1] "PCA for the Class a"
## Importance of components:
## Comp.1 Comp.2 Comp.3
## Standard deviation 1.2932492 1.0091473 0.5559931
## Proportion of Variance 0.5574978 0.3394594 0.1030428
## Cumulative Proportion 0.5574978 0.8969572 1.0000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3
## X_position 0.119 0.973 0.198
## Y_position 0.693 -0.224 0.685
## Z_position 0.711 -0.701
## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 1.6724935 55.74978 55.74978
## comp 2 1.0183782 33.94594 89.69572
## comp 3 0.3091283 10.30428 100.00000
## [1] "PCA for the Class b"
## Importance of components:
## Comp.1 Comp.2 Comp.3
## Standard deviation 1.1286537 0.9457021 0.9120243
## Proportion of Variance 0.4246197 0.2981175 0.2772628
## Cumulative Proportion 0.4246197 0.7227372 1.0000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3
## X_position 0.535 0.814 0.224
## Y_position 0.613 -0.192 -0.767
## Z_position 0.581 -0.548 0.602
## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 1.2738591 42.46197 42.46197
## comp 2 0.8943525 29.81175 72.27372
## comp 3 0.8317884 27.72628 100.00000
## [1] "PCA for the Class c"
## Importance of components:
## Comp.1 Comp.2 Comp.3
## Standard deviation 1.2156489 1.0027730 0.7187794
## Proportion of Variance 0.4926008 0.3351846 0.1722146
## Cumulative Proportion 0.4926008 0.8277854 1.0000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3
## X_position 0.709 0.705
## Y_position 0.371 0.847 -0.380
## Z_position -0.599 0.532 0.599
## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 1.4778023 49.26008 49.26008
## comp 2 1.0055538 33.51846 82.77854
## comp 3 0.5166439 17.22146 100.00000
## [1] "PCA for the Class d"
## Importance of components:
## Comp.1 Comp.2 Comp.3
## Standard deviation 1.2855781 0.9914895 0.6035210
## Proportion of Variance 0.5509037 0.3276838 0.1214125
## Cumulative Proportion 0.5509037 0.8785875 1.0000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3
## X_position 0.702 0.711
## Y_position 0.674 0.285 -0.681
## Z_position -0.230 0.958 0.173
## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 1.6527111 55.09037 55.09037
## comp 2 0.9830514 32.76838 87.85875
## comp 3 0.3642376 12.14125 100.00000
## [1] "PCA for the Class e"
## Importance of components:
## Comp.1 Comp.2 Comp.3
## Standard deviation 1.4038307 0.9685920 0.30180949
## Proportion of Variance 0.6569135 0.3127235 0.03036299
## Cumulative Proportion 0.6569135 0.9696370 1.00000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3
## X_position 0.245 0.970
## Y_position 0.685 -0.177 0.707
## Z_position 0.686 -0.170 -0.708
## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 1.97074054 65.691351 65.69135
## comp 2 0.93817049 31.272350 96.96370
## comp 3 0.09108897 3.036299 100.00000
## [1] "PCA for the Class f"
## Importance of components:
## Comp.1 Comp.2 Comp.3
## Standard deviation 1.3400570 0.9638120 0.52470332
## Proportion of Variance 0.5985843 0.3096445 0.09177119
## Cumulative Proportion 0.5985843 0.9082288 1.00000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3
## X_position 0.291 0.955
## Y_position -0.671 0.251 -0.698
## Z_position -0.682 0.162 0.713
## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 1.7957528 59.858428 59.85843
## comp 2 0.9289336 30.964453 90.82288
## comp 3 0.2753136 9.177119 100.00000
## [1] "PCA for the Class g"
## Importance of components:
## Comp.1 Comp.2 Comp.3
## Standard deviation 1.2971922 0.8883869 0.7266781
## Proportion of Variance 0.5609026 0.2630771 0.1760203
## Cumulative Proportion 0.5609026 0.8239797 1.0000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3
## X_position 0.596 0.487 0.639
## Y_position 0.632 0.207 -0.747
## Z_position 0.496 -0.849 0.184
## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 1.6827077 56.09026 56.09026
## comp 2 0.7892313 26.30771 82.39797
## comp 3 0.5280610 17.60203 100.00000
## [1] "PCA for the Class h"
## Importance of components:
## Comp.1 Comp.2 Comp.3
## Standard deviation 1.2203138 0.9683089 0.7571078
## Proportion of Variance 0.4963886 0.3125407 0.1910707
## Cumulative Proportion 0.4963886 0.8089293 1.0000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3
## X_position 0.664 0.253 0.704
## Y_position 0.337 -0.941
## Z_position 0.668 0.223 -0.710
## eigenvalue percentage of variance cumulative percentage of variance
## comp 1 1.4891657 49.63886 49.63886
## comp 2 0.9376221 31.25407 80.89293
## comp 3 0.5732122 19.10707 100.00000
First, the distance between time series is calculated using the Euclidean method and the symmetric distance matrix is obtained. Then, MDS is applied to this matrix and all time series are made two-dimensional. In other words, time series are expressed with 2 variables instead of multiple time indexes. In order to see the positions of the time series with respect to each other, they are plotted according to their binary values and colored with the gesture types.
dist_matrix<-dist(mds_data[,-c(1,2)],method = "euclidean")
MDS<-cmdscale(dist_matrix)
colnames(MDS) <- c("x","y")
MDS<-data.frame(MDS)
MDS<-cbind(mds_data[c(1,2)],MDS)
mds_plot<-ggplot(MDS,aes(x=x,y=y))+
geom_jitter(aes(color=class),size=3)+
scale_colour_brewer(palette ="Accent" )+
theme_minimal()+
labs(color="Gesture Class")
mds_plot
mds_plot+facet_wrap(~class)
For MDS to be successful, the same gesture types must be close to each other. According to plots above obtained as a result of MDS, this method gives very successful results for the gesture types a, b and g, while less successful results for the c, d and f groups in a wider area. However, time series belonging to the same class are not grouped in different regions, they display a collective image. For this reason, MDS becomes successful in making this classification.